An Information-Theoretic External Cluster-Validity Measure

نویسنده

  • Byron Dom
چکیده

In this paper we propose a measure of sim­ ilarity /association between two partitions of a set of objects. Our motivation is the desire to use the measure to characterize the quality or accuracy of clustering algorithms by some­ how comparing the clusters they produce with "ground truth" consisting of classes as­ signed by manual means or some other means in whose veracity there is confidence. Such measures are referred to as "external". Our measure also allows clusterings with differ­ ent numbers of clusters to be compared in a quantitative and principled way. Our eval­ uation scheme quantitatively measures how useful the cluster labels are as predictors of their class labels. It computes the reduction in the number of bits that would be required to encode (compress) the class labels if both the encoder and decoder have free access to the cluster labels. To achieve this encoding the estimated conditional probabilities of the class labels given the cluster labels must also be encoded. In addition to defining the mea­ sure we compare it to other commonly used external measures and demonstrate its supe­ riority as judged by certain criteria. 1 The Clustering Problem The most common unsupervised-learning problem is clustering, in which we are given a set 0 of objects or patterns 0 = {w; li = 1, 2, . . . , n} and each object has a representation x; := x(w; ) in some feature space which is frequently treated as an m-dimensional continuum !Rm. Some of the features may be categorical, however. The goal in clustering is to group the objects by grouping their associated feature vectors X = { x;li = 1, 2, . .. , n} . This grouping can be based on any num­ ber of criteria. It is assumed that the dimensions of x are attributes relevant to some application of interest. The grouping is performed on the basis of some mea­ sure of similarity relevant to the application and as­ sociated feature space. There are numerous objective functions and algorithms for clustering (see (JD88] for a survey), but we are not concerned with these here. Our task is to devise a measure of the quality of the output of clustering algorithms. Let K = {k;li = 1, 2, . . . , n} be a set of cluster labels assigned to the elements of X. The labels themselves are taken from a set K., where IKI is the number of clusters. We have some clustering procedure f that maps X to K. Definition: clustering procedure f: f : X(O) -+ K(O) (1) The procedure f may determine the optimal number of clusters as well as the assignment of feature vectors (objects) to class labels or it may accept the number of clusters as input. The set 0 can be considered to have been drawn from some larger population, which can be characterized by a probability density p( x). The combination of p( x) and the clustering procedure f results in a probability distribution {p( k)} over cluster labels. We define three clustering problems: (1) Each pat­ tern is assigned to one and only one cluster so-called partitional clustering. (2) Each pattern may be as­ signed to multiple clusters. These are binary assign­ ments. ( 3) Each pattern has a degree of membership in each cluster. The measure we propose applies to parti­ tional clustering. In addition to these three categories a distinction can be made between flat and hierarchical clustering (although a flat is technically a special case of hierarchicali.e. a depth-one tree). Our measure applies to flat clustering.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Development of An External Cluster Validity Index using Probabilistic Approach and Min-max Distance

Validating a given clustering result is a very challenging task in real world. So for this purpose, several cluster validity indices have been developed in the literature. Cluster validity indices are divided into two main categories: external and internal. External cluster validity indices rely on some supervised information available and internal validity indices utilize the intrinsic structu...

متن کامل

AN INFORMATION - THEORETIC EXTERNAL CLUSTER - VALIDITYMEASUREByron

This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and speciic requests. After outside...

متن کامل

Extending k-Representative Clustering Algorithm with an Information Theoretic-based Dissimilarity Measure for Categorical Objects

This paper aims at introducing a new dissimilarity measure for categorical objects into an extension of k-representative algorithm for clustering categorical data. Basically, the proposed dissimilarity measure is based on an information theoretic definition of similarity introduced by Lin [15] that considers the amount of information of two values in the domain set. In order to demonstrate the ...

متن کامل

Entropy-Based Cluster Validation and Estimation of the Number of Clusters in gene Expression Data

Many external and internal validity measures have been proposed in order to estimate the number of clusters in gene expression data but as a rule they do not consider the analysis of the stability of the groupings produced by a clustering algorithm. Based on the approach assessing the predictive power or stability of a partitioning, we propose the new measure of cluster validation and the selec...

متن کامل

Unsupervised Feature Selection by Means of External Validity Indices

Feature selection for unsupervised data is a difficult task because a reference partition is not available to evaluate the relevance of the features. Recently, different proposals of methods for consensus clustering have used external validity indices to assess the agreement among partitions obtained by clustering algorithms with different parameter values. Theses indices are independent of the...

متن کامل

A Game-theoretic Approach for Enumerating Dominant Sets

Recently, a novel graph-theoretic notion of a cluster has been proposed, i.e., the “dominant set”, which captures the two basic requirements of a cluster, namely internal coherency and external incoherency. In this paper, we tackle the problem of finding several dominant sets using the replicator dynamics. Specifically, we adopt a game-theoretic perspective to this iterative extraction: Game dy...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002